Graph-based multi-modality integration for prediction of cancer subtype and severity

Personalised cancer screening before therapy paves the way toward improving diagnostic accuracy and treatment outcomes. Most approaches are limited to a single data type and do not consider interactions between features, leaving aside the complementary insights that multimodality and systems biology can provide. In this project, we demonstrate the use of graph theory for data integration via individual networks where nodes and edges are individual-specific. We showcase the consequences of early, intermediate, and late graph-based fusion of RNA-Seq data and histopathology whole-slide images for predicting cancer subtypes and severity. The methodology developed is as follows: (1) we create individual networks; (2) we compute the similarity between individuals from these graphs; (3) we train our model on the similarity matrices; (4) we evaluate the performance using the macro F1 score. Pros and cons of elements of the pipeline are evaluated on publicly available real-life datasets. We find that graph-based methods can increase performance over methods that do not study interactions. Additionally, merging multiple data sources often improves classification compared to models based on single data, especially through intermediate fusion. The proposed workflow can easily be adapted to other disease contexts to accelerate and enhance personalized healthcare.

Relevance of differentiating Gleason score 3 versus 4 The relatively benign nature of homogeneous, low-volume Gleason 3 tumors stands in contrast to the progressive risk of biochemical recurrence and prostate cancer-specific mortality associated with increasing quantities of Gleason 4 components [5].Notably, these differences underscore the existence of distinct cancer diatheses, each demanding tailored approaches.Furthermore, tumors with Gleason score 3+4 or Gleason score 4+3 are characterized by significant heterogeneity in their biological behavior [2].The prognosis for patients with Gleason scores 3+4 and 4+3 tumors at radical prostatectomy exhibits notable differences.

Model comparison
We compared our graph-based approach and its variants to several classification methods applied to the raw features.Data were pre-processed as in the graph approach, i.e. the same variables were selected (Section 3.1).For the penalized logistic regression, we used function cv.glmnet from the package glmnet [3] with options alpha = 1, lambda = NULL.For the random forest, we applied function randomForest from the package randomForest [6] with option ntree = 500.For AdaBoost, we used the function boosting from the package adabag [1] with option boos = T RU E, and mfinal = 50.For the classification tree, we applied the function rpart from the package rpart [8] with the default options.For the naive Bayes approach, we used the function naiveBayes from the package e1071 [7] with the default options.Finally, we applied a neural network.In particular, we have used the neuralnet function from the neuralnet R package [4].The neural network consists of two hidden layers and we set the number of neurons per layer √ # nodes in the previous layer × # nodes in the output layer.The parameter linear.output= FALSE and default options were used.

Effect of data imbalance
While the workflow doesn't include a preprocessing step directly on the input data, such as under or over-sampling, it incorporates strategies aimed at alleviating the impact of data imbalance.Firstly, the dataset-specific feature selection step includes the use of appropriate evaluation metrics designed for imbalanced data.Indeed, to determine the optimal feature selection thresholds, we conducted a stratified 5-fold cross-validation within the training set, choosing parameters that yielded the highest average macro F1 score.This approach ensures that the feature selection process focuses on maintaining a balance between classes.Furthermore, when tuning hyperparameters for the Support Vector Machine, we rely on cross-validation evaluated using the macro F1 score as well.This fine-tunes our models to perform well in imbalanced settings.An alternative option would have been to employ a class-weighted SVM, which addresses unbalanced data by assigning higher misclassification penalties to training instances of the minority class.
To provide a more comprehensive evaluation of the models' performance and assess potential bias towards specific classes, the class-wise F1 scores are summarized in Table 2.With the graph-based approach, the difference between the two F1 scores stands at 3.3% on average.In two of the nine analyses (columns), the F1 score is even equal between the two classes.The largest discrepancy (11%), is observed in the context of prostate cancer classification based on RNASeq data.Overall, these findings indicate that there is no substantial disparity between the F1 scores achieved in the two groups, and suggests that the approach adeptly handles data imbalance.

SUPPLEMENTARYFig. 1 :
Fig. 1: (a) Multi-modality fusion from Person-to-Person networks.Nodes are individuals and edges show how close 2 individuals are.(b) Individual network.Nodes and/or edges are individual-specific.

Fig. 2 :
Fig. 2: Overview of the workflow variations evaluated.The rows describe the data fusion methods, and columns show the information used to build the Person-to-Person Network.

Fig. 3 :
Fig. 3: Visualization of the PPNs used as input for SVM-based prostate cancer subtype prediction, created using UMAP.The three first rows refer to approaches based on the nodes of the individual networks.Rows 4 and 5 use the edge weights of the individual networks.Rows 6 and 7 combine individual nodes and edges.The two first columns focus on a single data modality.Columns 3 to 5 refer to data integration.

Fig. 4 :
Fig. 4: Visualization of the PPNs used as input for SVM-based prostate brain subtype prediction, created using UMAP.The three first rows refer to approaches based on the nodes of the individual networks.Rows 4 and 5 use the edge weights of the individual networks.Rows 6 and 7 combine individual nodes and edges.The two first columns focus on a single data modality.Columns 3 to 5 refer to data integration.

Fig. 5 :Fig. 6 :Fig. 7 :Fig. 8 :
Fig. 5: Visualization of the PPNs used as input for SVM-based lung cancer subtype prediction, created using UMAP.The three first rows refer to approaches based on the nodes of the individual networks.Rows 4 and 5 use the edge weights of the individual networks.Rows 6 and 7 combine individual nodes and edges.The two first columns focus on a single data modality.Columns 3 to 5 refer to data integration.

Table 1 :
Cancer types used for histopathology data feature extraction.

Table 2 :
Class-wise F1 scores (%), ie F 1 group1 /F 1 group2 , for the different inputs (RNASeq data only, histopathology images only or fusion of the two modalities) and algorithms evaluated.